Web Page Quality Estimation Based on Linear Discriminant Function
نویسندگان
چکیده
With the growth of web data, how to estimate web page quality effectively and rapidly becomes more and more important for web information retrieval and knowledge discovery. This paper analyzes the differences between retrieval target pages and ordinary pages using query-independent features. Using these features, an algorithm called Linear Page Estimation (LPE) is proposed for web page quality estimation. Based on experiments on .GOV corpus and SOGOU corpus involving 26 million pages, about 95% pages can be reduced with more than 90% retrieval target pages retained using our algorithm. Experimental results based on TREC datasets also show that retrieval performance on collections selected by our algorithm can be close to or even better than that on the whole collection.
منابع مشابه
Enhanced Model-Based Clustering, Density Estimation, and Discriminant Analysis Software: MCLUST
Abstract: MCLUST is a software package for model-based clustering, density estimation and discriminant analysis interfaced to the S-PLUS commercial software and the R language. It implements parameterized Gaussian hierarchical clustering algorithms and the EM algorithm for parameterized Gaussian mixture models with the possible addition of a Poisson noise term. Also included are functions that ...
متن کاملInformation Quality of Commercial Web Site Home Pages: An Explorative Analysis
In the search for substantive relationships in the use of emerging technology, information quality is often difficult to assess. This research explores user perceptions of presentation, navigation, and quality of Web home pages for approximately 200 selected Fortune 500 companies across 10 industries. An instrument is developed to measure these constructs and is assessed for convergent and disc...
متن کاملScott Nicholson - Bibliomining for Automated Collection Development in a Digital Library Setting: Using Data Mining to Discover Web-Based Scholarly Research Work
Nicholson, S. (2003). Bibliomining for automated collection development in a digital library setting: Using data mining to discover web-based scholarly research works. 0. ABSTRACT This research creates an intelligent agent for automated collection development in a digital library setting. It uses a predictive model based on facets of each Web page to select scholarly works. The criteria came fr...
متن کاملDirect Multi-label Linear Discriminant Analysis
Multi-label problems arise in different domains such as digital media analysis and description, text categorization, multi-topic web page categorization, image and video annotation etc. Such a situation arises when the data are associated with multiple labels simultaneously. Similar to single label problems, multi label problems also suffer from high dimensionality as multi label data often hap...
متن کاملDesigning a Volunteer Geographic Information-based service for rapid earth quake damages estimation
Designing a Volunteer Geographic Information-based service for rapid earth quake damages estimation Introduction The advent of Web 2.0 enables the users to interact and prepare free unlimited real time data. This advantage leads us to exploit Volunteer Geographic Information (VGI) for real time crisis management. Traditional estimation methods for earthquake damages are expensive and tim...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2007